cider-d score
Relational Future Captioning Model for Explaining Likely Collisions in Daily Tasks
Kambara, Motonari, Sugiura, Komei
Domestic service robots (DSRs) that naturally communicate In this paper, we propose the Relational Future Captioning with users to support household tasks are a promising solution Model (RFCM). The RFCM can generate captions that take for elderly or disabled people. DSRs are expected to perform into account the relationship between past events. This is because most tasks autonomously, and so they could damage objects it has a source-target attention structure that generates and themselves. It therefore would be useful if they could explain appropriate captions for future events from the relationships the potential risks associated with their actions through between events. In this structure, the features derived from natural language. However, a DSR's ability to generate natural past clips are used as a source, and the features derived from language explanations is still insufficient.
B-SCST: Bayesian Self-Critical Sequence Training for Image Captioning
Bujimalla, Shashank, Subedar, Mahesh, Tickoo, Omesh
Bayesian deep neural networks (DNN) provide a mathematically grounded framework to quantify uncertainty in their predictions. We propose a Bayesian variant of policy-gradient based reinforcement learning training technique for image captioning models to directly optimize non-differentiable image captioning quality metrics such as CIDEr-D. We extend the well-known Self-Critical Sequence Training (SCST) approach for image captioning models by incorporating Bayesian inference, and refer to it as B-SCST. The "baseline" reward for the policy-gradients in B-SCST is generated by averaging predictive quality metrics (CIDEr-D) of the captions drawn from the distribution obtained using a Bayesian DNN model. This predictive distribution is inferred using Monte Carlo (MC) dropout, which is one of the standard ways to approximate variational inference. We observe that B-SCST improves all the standard captioning quality scores on both Flickr30k and MS COCO datasets, compared to the SCST approach. We also provide a detailed study of uncertainty quantification for the predicted captions, and demonstrate that it correlates well with the CIDEr-D scores. To our knowledge, this is the first such analysis, and it can pave way to more practical image captioning solutions with interpretable models.